Assignment

A short description of the post.

Rhoda Tong https://public.tableau.com/app/profile/mt.tong/viz/DataVizMakeover2_16240837165630/TradeDB
06-26-2021

Background

Since 20 years ago, GASTech has been operating a natural gas production site in the island country of Kronos. The business has been profitable, and the company has also developed close relationships with the Kronos Government. In January 2014, following GASTech’s initial public offering listing, several GASTech employees has gone missing. An organization known as Protectors of Kronos (POK) is suspected in these cases of missing persons, as GASTech’s business moves had not been too environmentally friendly. A thorough investigation is to be carried out by the law enforcers of Kronos and Tethys to break this case.

Information and data pertaining to the whereabouts of company cars, purchases made by employees in local stores have been provided to the law enforcers. We shall use visual analytics to sense-make this data to facilitate the investigation.

This would be done as a sub-component which would eventually feed into an interactive Shiny app for the use of the law enforcers, together with other sub-components covered by my project group mates. The objective of this assignment is to explore what the insights are and how they can be brought out from the depths of this dataset to aid in the investigation.

Literature Review

Crime Analysis - Overall

Crime analysis is a law enforcement function which involves systematic analysis for identifying and analyzing patterns and trends in crime and disorder [Wikipedia]. Too little data would inevitably limit the efficiency of the investigation, but overwhelming volume of information could pose a huge challenge as well. Coupled with the need for rapid analysis, too much information to absorb, categorize, remember and draw meaning from could compromise the overall investigation [Crime-Analysis ScienceDirect]. Visual analytics techniques could be employed to gain useful insight from massive raw data.

For efficiency in data processing, information must first be consistent as subtle differences can greatly increase variability and reduce the reliability and value of a dataset [Crime-Analysis ScienceDirect]. Next, there cannot be information overload within a diagram. Good practices such as appropriate brushing and linking, selecting and marking, aggregation, elimination, virtual navigation techniques such as zooming, focus + context, and details-on-demand techniques have been studied and used to overcome an over-cluttered screen.[Ku et. al, 2016]. Uninteresting and expected patterns can also be unmarked to improve efficiency and reduce false positives. [Arxiv]

Approaches to Analyze Anomalous User Behaviors

According to [Arxiv, Visual of Anomalous User Behavior], detection of anomalous user behaviors can be a challenging task as the boundary between abnormal and normal data may not be clearly defined, and approaches like machine learning lack contextual information to support decision-making. Visualization techniques like sequence visualization, graph visualization, text visualization, geographic visualization, chart visualization can be combined with interaction methods like tracking and monitoring, pattern discovery, exploration and navigation to analyze anomalous user behaviors.

Analysis of anomalous travel behaviors can take the following approaches:

Analysis of anomalous transactions can take the following approaches:

Visualizing Movement / Geospatial Map Types

According to Robert Krueger [Year], movement data is more complex to handle than simple point-based data as it contains complex hierarchical structures of overlapping trajectories with diverse shapes and directions.

  1. Graphs (R. Krueger)

Movements can be spatially aggregated. A full spatial and temporal aggregation of the trajectories can result in a static graph G = (V,E) consisting of nodes V and edges E. Each edge e = (u,v) can encode directions and contain a weight that holds the travel volumes between the nodes. Analysis is flexible with this graph network and techniques such as clustering, segmentation, aggregation can be performed.

  1. Thematic Maps

This is the world famous cholera map produced by Doctor John Snow in 1854. Each bar plotted onto the map represents a death case. It was then immediately apparent where the deaths clustered, enabling investigation to be directed and focused. It eventually led them to the culprit water pump in Broad street which was polluted by sewage water tainted by a disposed baby nappy with cholera. Death statistics on their own might not have led to this discovery this soon had it not been geo-localized. This spatial autocorrelation is powerful.

  1. Spatial-Temporal Perspectives: Multiple Coordinated Views (MCV)

Visualizations can include many types such as thematic maps, scatter plots, parallel coordinate plots, timelines and a wide range of other techniques. Interactivity to allow quick switching between these views can facilitate more insights. These systems are described generally as coordinated-view visualizations. An example is shown as below. Global population trends by country are compared using a parallel coordinate plot, choropleth map and treemap. (Image courtesy of the National Center for Visual Analytics at Linkoping University)

  1. Spatial-Temporal Perspectives: 3D Coordinate System (Space-Time Cubes)

There can be perspective distortion and occlusion, but the spatio-temporal distribution can be highlighted. [R. Krueger]. Space-time cubes show change over time within geographic space. Each cube represents a slice of time, in which the topmost cube has the newest timestamp. Temporal changes in that geographic area can then be visualized. Map below shows a space-time cube web scene in ArcGIS Online (AGOL).

R Packages to Be Applied

  1. DT - To make data tables interactive
  2. plotly - For interactive charts
  3. mapview, tmap, sf, raster - to handle geospatial data and movement visualization
  4. lubridate, clock - to handle time and date values
  5. scales - To handle label formatting in charts
  6. naniar - For data exploration ; Check for missing values

crosstalk - To create links between objects for interactivity raster - to handle geospatial data sf - to handle geospatial data tmap - for map visualization clock - to take care of time value (newer r package compared to lubridate) lubridate - mapview - to visualize maps

raster, tmap, sf, clock (for movement visualization)

for geospatial data tmap for map visualization raster and sf for handling geospatial data can use clock package to suss out the day of week/day of year for our MC2 question.

Data Available

The following information is available.

  1. Employee that the company car is assigned to : Last Name, First Name, CarID, Current Employment Type, Current Employment Title [car-assignments.csv]
  2. GPS Tracking Info of Company Cars : Date, Time, Car ID, Latitude, Longitude [gps.csv]
  3. Credit Card Transactions: Location, Date, Time, Price, Credit Card No. [cc_data.csv]
  4. Loyalty Card: Location, Date, Price, Loyalty Card No. [loyalty_data.csv]
  5. Employee Information: LastName, FirstName, Gender, CurrentEmployment Title [EmployeeRecords.csv]

Data Wrangling and Preparation

The type of data required would vary with the questions. This section would only cover the main data set up, EDA, quality of data, any general manipulations such as correcting the format of the data.

Special data manipulation specific to the questions would be covered in their respective sections instead.

We first load the datasets, via the read_csv() function.

car_assigned <- read_csv("data/car-assignments.csv")
gpstracking <- read_csv("data/gps.csv")
loyaltycard <- read_csv("data/loyalty_data.csv")
creditcard <- read_csv("data/cc_data.csv")
emprecords <- read_csv("data/EmployeeRecords.csv")

To check for missing values, we use the naniar package. The naniar package provides tidy ways to summarize, visualize and manipulate missing data.

#For car_assigned
miss_var_summary(car_assigned)
# A tibble: 5 x 3
  variable               n_miss pct_miss
  <chr>                   <int>    <dbl>
1 CarID                       9     20.5
2 LastName                    0      0  
3 FirstName                   0      0  
4 CurrentEmploymentType       0      0  
5 CurrentEmploymentTitle      0      0  
#For creditcard
miss_var_summary(creditcard)
# A tibble: 4 x 3
  variable   n_miss pct_miss
  <chr>       <int>    <dbl>
1 timestamp       0        0
2 location        0        0
3 price           0        0
4 last4ccnum      0        0
#For gpstracking
miss_var_summary(gpstracking)
# A tibble: 4 x 3
  variable  n_miss pct_miss
  <chr>      <int>    <dbl>
1 Timestamp      0        0
2 id             0        0
3 lat            0        0
4 long           0        0
#For loyaltycard
miss_var_summary(loyaltycard)
# A tibble: 4 x 3
  variable   n_miss pct_miss
  <chr>       <int>    <dbl>
1 timestamp       0        0
2 location        0        0
3 price           0        0
4 loyaltynum      0        0

Here, we observe that the only dataset with missing values is car_assigned. From the filter below, we see that the truck drivers are not assigned cars. For now, other than to acknowledge this fact, we are indifferent to the missing values as there seems to be no other dataset which contains LastName and FirstName to be able to use the information below. Later, we would see that there is location tracking information on vehicles not within the cars list. These vehicles can be associated with any of the truck drivers below.

For now, these rows can also be removed since we will be doing a join with the other datasets and would require a uniquely valid column without missing values. They will be removed via the complete.cases() function.

car_assigned %>% 
  filter(is.na(CarID))
# A tibble: 9 x 5
  LastName  FirstName CarID CurrentEmploymentTy~ CurrentEmploymentTit~
  <chr>     <chr>     <dbl> <chr>                <chr>                
1 Hafon     Albina       NA Facilities           Truck Driver         
2 Hawelon   Benito       NA Facilities           Truck Driver         
3 Hawelon   Claudio      NA Facilities           Truck Driver         
4 Mies      Henk         NA Facilities           Truck Driver         
5 Morlun    Valeria      NA Facilities           Truck Driver         
6 Morlun    Adan         NA Facilities           Truck Driver         
7 Morluniau Cecilia      NA Facilities           Truck Driver         
8 Nant      Irene        NA Facilities           Truck Driver         
9 Scozzese  Dylan        NA Facilities           Truck Driver         
car_assigned_only <- car_assigned[complete.cases(car_assigned),]

Next, we shall ensure that the Timestamps are in the right and consistent format, using the lubridate package. For the loyaltycard dataset, the timestamp is only in mdy format. We will observe that the timestamp will be converted to POSIXct (for creditcard and gpstracking dataset), and Date (for loyaltycard as there is only date data).

#For creditcard dataset
creditcard$TimeStampFormatted <-mdy_hm(creditcard$timestamp)

#Delete timestamp column
creditcard <- creditcard %>% 
  dplyr::select(-timestamp)

#Reorder columns
col_order <- c("TimeStampFormatted", "location","price","last4ccnum")
creditcard <- creditcard[, col_order]

#For gpstracking dataset
gpstracking$TimeStampFormatted <-mdy_hms(gpstracking$Timestamp)
gpstracking$hour <- hour(gpstracking$TimeStampFormatted)

#Delete timestamp column
gpstracking <- gpstracking %>% 
  dplyr::select(-Timestamp)

#Reorder columns
col_order <- c("TimeStampFormatted", "id","lat","long","hour")
gpstracking <- gpstracking[, col_order]

#For loyaltycard dataset
loyaltycard$TimeStampFormatted <- mdy(loyaltycard$timestamp)

#Delete timestamp column
loyaltycard <- loyaltycard %>% 
  dplyr::select(-timestamp)

#Reorder columns
col_order <- c("TimeStampFormatted", "location","price","loyaltynum")
loyaltycard <- loyaltycard[, col_order]

We also observe special unidentifiable characters in Katerina’s Cafe in the creditcard and loyaltycard dataset. Those shall be identified and replaced using the str_replace_all() function to prevent error in data processing.

#creditcard

creditcard <- creditcard %>% 
  mutate(location = str_replace_all(location,pattern = "Katerin.+",replacement = "Katerinas Cafe"))%>%   mutate(location = str_replace_all(location,pattern = "[^[:alnum:]]",replacement = " "))

#loyaltycard

loyaltycard <- loyaltycard %>% 
  mutate(location = str_replace_all(location,pattern = "Katerin.+",replacement = "Katerinas Cafe"))%>%   mutate(location = str_replace_all(location,pattern = "[^[:alnum:]]",replacement = " "))

Next, we include into the gpstracking dataset, the first and last names of the personnel the car is assigned to. We essentially want to do a left join for the gpstracking dataset, and the car_assigned dataset, by the car ID. We can do this with the left_join() function.

gpsname <- left_join(gpstracking, car_assigned_only, by = c("id" = "CarID"))

#Join First Name and Last Name

gpsname$name <- paste(gpsname$FirstName, gpsname$LastName)

#Reorder cols
col_order <- c("TimeStampFormatted", "id","lat","long","hour","name","LastName","FirstName","CurrentEmploymentType","CurrentEmploymentTitle")
gpsname <- gpsname[, col_order]

Next, we shall prepare the geospatial map for viewing. We would use the Raster package to import the raster file for the map of Abila.

The file to be imported is already georeferenced using qGIS, into .tif format. Note that the raster layer is a three bands false colour image, we would use tm_rgb() instead of tm_raster() to be able to display all three bands. If not, the layer would come out in monochrome.

#Importing the Raster file

bgmap <- raster("data/MC2/MC2-tourist_modified.tif")

#Plotting the Raster Layer and defining as base layer.

tmap_mode("view")

tmain <- tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1, g = 2, b = 3,
            alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255)

Then, we shall map the aspatial data next. Essentially, the following general steps are required to be able to create layers to visualize on the map:

  1. Select the latitude and longitude coordinates that we want displayed. These coordinates are in the .dbl format.

  2. Convert it to Simple Feature Data Frame via the st_as_sf() function of the sf package; Coordinates would be converted to geometry format. They would be input as longitude (‘long’ ; x-coordinates) and latitude (‘lat’ ; y-coordinates), in the EPSG: 4326 format, which is the wgs84 Geographic Coordinate System.

Example:

gps_sf <- st_as_sf(gpstracking,
                   coords = c("long", "lat"),
                   crs = 4326)
  1. We can either use them as individual geometry points, or string them up to form a path

  2. To form a path, we use the st_cast(“LINESTRING”) function

Example:

(Here, we are creating the movement path from GPS points for each car. Hence, we need to group the data by the car ID, the identifier. As R requires a command following the group_by() function, an input will be required for the code to run, so we include a dummy summarize() code to overcome this issue.)

gps_path <- gps_sf %>% 
  group_by(id) %>% 
  summarize(m = mean(TimeStampFormatted),
            do_union = FALSE) %>% 
  st_cast("LINESTRING")

We would like to find out where the vehicles have gone to. First, we have to identify the coordinate points where the vehicles have possibly made stops, and be able to visualize where these stops are on the map. We could decipher this information from the gpstracking data, which tracks the coordinate points of vehicles as long as they are moving. This means that when there is a long gap in the timestamp at a particular coordinate point, it is likely that the vehicle has parked. Stops at traffic light junctions should take no longer than 2 minutes. As such, we shall assume for vehicles which are stationary for more than 4 minutes to be parked (i.e. search for timestamps with lag of more than 4 minutes).

In addition, we note that for position coordinates, the number of decimal places required for a particular accuracy at the equator is:

Source: http://wiki.gis.com/wiki/index.php/Decimal_degrees

Considering the sizes of typical carparks, the accuracy of 1.11m would be too precise. A more likely range could be 11.1m. With this, we shall round all our coordinate points up to 4 decimal points for analysis.

A map with all stops identified from the entire gpstracking dataset is as shown.

#Round lat long points to 4 decimal points

gpstracking$lat <- round(gpstracking$lat, 4)
gpstracking$long <- round(gpstracking$long, 4)

#Identifying all stops for all Car IDs and Time

tmap_mode("view")

all_stops <- gpstracking %>%
  group_by(id) %>%
  mutate(stop = TimeStampFormatted - lag(TimeStampFormatted)) %>%
  mutate(parked = ifelse(stop >60*4, TRUE,FALSE)) %>%
  ungroup() %>%
  filter(parked == TRUE) %>%
  distinct(lat, long)

#Converting it to sf

all_stops_sf <- st_as_sf(all_stops,
                         coords = c("long", "lat"),
                         crs = 4326) %>% 
  mutate(coordinates = geometry)

#Viewing it on the map

tm_all_stops <- tmain +
  tm_shape(all_stops_sf) +
  tm_dots(size = 0.1,
          alpha = 0.3,
          col = "red")

tm_all_stops

With all of the above codes, the following function print_routes_ID_date(emp_id, start_dt,end_dt) is created to generate the routes of a specific vehicle, during a specific time period. This way, we are able to print the routes for a specific vehicle ID, for a selected time period from start_dt* to end_dt.

print_routes_ID_date <- function(emp_id,start_dt,end_dt){
  
  #filter gps_path by ID and datetime
  
  #Start with gpstracking, filter
  
  id_time_select <- gpstracking %>% as_tbl_time(index=TimeStampFormatted) %>%
  filter(id == emp_id) %>% 
  filter_time(start_dt ~ end_dt)
  
  #Convert to selected sf
  
  gps_sf_selected <- st_as_sf(id_time_select,
                         coords = c("long", "lat"),
                         crs = 4326) %>% 
    mutate(coordinates = geometry)
  
  #Create a LineString
  
    gps_path_selected <- gps_sf_selected %>% 
  group_by(id) %>% 
  summarize(m = mean(TimeStampFormatted),
            do_union = FALSE) %>% 
  st_cast("LINESTRING")
  
    #filter stop points by ID and datetime
  
  stops <- gpstracking %>% as_tbl_time(index=TimeStampFormatted) %>%
  group_by(id) %>%
  filter(id == emp_id) %>% 
  filter_time(start_dt ~ end_dt) %>% 
  mutate(stop = TimeStampFormatted - lag(TimeStampFormatted)) %>%
  mutate(parked = ifelse(stop >60*4, TRUE,FALSE)) %>%
  ungroup() %>%
  filter(parked == TRUE) %>%
  distinct(lat, long)
    
  #Converting stop points to sf
  
  stops_sf <- st_as_sf(stops,
                         coords = c("long", "lat"),
                         crs = 4326) 
  
  #Viewing it on the map
  
  mapviz <- tm_all_stops +
    tm_shape(gps_path_selected) +
    tm_lines() +
    tm_shape(stops_sf) +
    tm_dots(size = 0.1,
            alpha = 0.3,
            col = "green")
  
  return(mapviz)
  
}

Example of printing the route for ID #15, on 7th Jan.

print_routes_ID_date(15,'2014-01-07','2014-01-07')

Questions

Question 1

Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies?

From the bar plot above, by credit card transactions, Katerina’s Cafe, Hippokampos, Guy’s Gyros and Brew’ve Been Served are the most popular locations. We shall see the outcome when the loyalty card transactions are analyzed in the same manner. From the bar plot below, the same results are derived for the Top 4 most popular locations.

However, the bar plots only tell us the total number of visits over the 2 weeks. We do not know which days have higher visits, or the profile of the visits over the day. To view this, we can use violin plots. The thickness of the plots will also reflect the frequency of the visit at that point in time. We will also be able to see which days did the visits take place.

We shall use the violin plots in the plotly package as they are interactive, we can hover over the plots, especially where they are thicker to see what time periods are those.

From the violin plot above, we can see that the top 3 frequented locations identified above (Katerina’s Cafe, Hippokampos, Guy’s Gyros, Brew’ve been Served) have a pretty consistent visit rate daily throughout the 2 weeks from 6th to 20th Jan. The thickness of the violin plot is consistent. This means that they are popular throughout the 2 weeks.

However, it is observed that other locations can have higher visiting rates at specific periods than the identified Top 3. For example, the visiting rates for Desaflo Golf Course was much higher specifically on 19th Jan, same for Kronos Pipe and Irrigation on 7th Jan. Hallowed Grounds also had peak visiting rates on 8th and 15th Jan.

The violin plot showed a good overview that the visiting days vary across the locations. Now we shall dissect this further into day and time.

We shall use the geom_tile plot from ggplot to visualize this, wrapped with ggplotly() to provide us with interactivity as we would be able to hover over each square and see the corresponding location and date/time. One tile plot would be for “Visits at Each Location for Each Day” and “Visits at Each Location for Time of Day”.

We would first prepare the dataset for this plot by extracting the date and time data into different columns, then do a count for them respectively. As the timestamp has been formatted into POSIXct format earlier, R already stores it in dttm (date-time) format. We can just use the as_date and as_hms functions from the lubridate package to extract them out. We should retain them in date and time format for easier manipulation later.

glimpse(creditcard)

class(creditcard$TimeStampFormatted)

creditcard$date <- as_date(creditcard$TimeStampFormatted)

creditcard$time <- hms::as_hms(creditcard$TimeStampFormatted)

Next, we would count the number of visits to each location for each Day and time period. While the Date variable is more discrete and easier to count, more work needs to be done to count for the Time variable as it would not make sense to count how many visits are there when the Time value is detailed to the second. Hence, we will round the timestamp down to the nearest half hour for this count, via the round_date() function.

#Count by Day
cardbyday <- creditcard %>% 
  group_by(date) %>% 
  add_count(date, location) %>% 
  distinct(date, location, n)

loyalcardbyday <- loyaltycard %>% 
  group_by(TimeStampFormatted) %>% 
  add_count(TimeStampFormatted, location) %>% 
  distinct(TimeStampFormatted, location, n)

#Count by Time
#Round the time to the nearest half hour

cardbytime <- creditcard %>% 
  group_by(thirtymins = hms::as_hms(round_date(TimeStampFormatted, "30 mins"))) %>% 
  add_count(thirtymins, location) %>% 
  distinct(thirtymins, location, n)

With the data prepared, we shall do the tile plot via geom_tile() from ggplot, then wrapped by ggplotly() for interactivity.

This would be for the “Visits at each Location for Each Day (Credit Card)”.

This is the same plot plotted via loyalty card data.

Next, this would be for “Visits at Each Location for Time of Day”, which can only come from the credit card transactions dataset as transaction time information is not present in loyalty card dataset.

From the plots above, we could see that Katerina’s Cafe, Hippokampos, Guy’s Gyros and Brew’ve Been Served have the lightest bars across the 2 weeks. This shows that they are consistently popular. We also note that Guy’s Gyros is more popular during weekdays than weekends, and Brew’ve Been Served only has visits on weekdays, indicating that it is probably not open on weekends.

No. Location Periods Popular Remarks
1 Katerina’s Cafe All Days ; Lunchtime (1300 hrs - 1430 hrs), Dinnertime (1900 hrs - 2130 hrs) NIL
2 Hippokampos All Days ; Lunchtime (1230 hrs - 1430 hrs), Dinnertime (1930 hrs - 2230 hrs) NIL
3 Guy’s Gyros More Popular During Weekdays ; Lunchtime (1300 hrs - 1430 hrs), Dinnertime (1900 hrs - 2130 hrs) Unusually crowded on 19 Jan (Sun) compared to the previous Sunday and weekend crowd
4 Brew’ve Been Served Weekday Mornings ; Morning Coffee (0730 hrs - 0830 hrs) Huge Crowd at 0800 hrs ; Seems closed on weekends

Anomalies Detected

Hippokampos is nowhere to be found on the map of Abila

We note that the above popular locations are all identified on the map of Abila, and are located near Gastech, all except for Hippokampos, which is nowhere to be found on the map. This is an anomaly. To be able to solve the mystery of where Hippokampos is, we would have to delve into other data like vehicle tracking. We would need to match transaction timings with gps tracking data, the coordinates matching at that time would tell us the position of Hippokampos.

Awfully infrequent visits at refuel stations U Pump and Frank’s Fuel

U Pump is a refuel station, yet it is only visited twice from the credit card transactions, on two separate days - 6th and 13th Jan. The frequency of refuel is unusually infrequent. It could be due to employees not needing to pay for the fuel using their own credit cards. But strange thing is the loyalty card is also not used either. The same goes for Frank’s Fuel, there were only transactions on 2 days - 8th and 18th Jan. This is just an abnormal situation, which we hope to seek further insights later. Or it could be that there were visits, but purchases were neither paid by credit card nor loyalty card was applied. This would be suspicious, as though the purchasers wanted to keep their tracks secret. We can similary look at gps tracking data to see if there were stops at the fuel stations.

Unusually infrequent visits for supermarkets

Supermarkets should have daily transactions. However, according to credit card transaction data, Kronos Mart and General Grocer can have 3 - 5 days without any transactions within the 2 weeks. Maximum number of transactions only goes up to 3 for Kronos Mart, and 2 for General Grocer for one day each. Business seems awfully bad.

When we compare this with loyalty card transaction data, there are still days without any transactions, but we see the following differences:

Date Credit Card Data Loyalty Card Data
9 Jan No transaction 1 transaction
10 Jan 1 transaction No transaction
12 Jan 1 transaction 2 transactions
13 Jan 2 transactions No transaction
14 Jan 1 transaction No transaction
15 Jan No transaction 1 transaction
17 Jan 1 transaction No transaction
18 Jan No transaction 3 transactions
19 Jan 3 transactions No transaction

This means that there are transactions paid by cash, not paid by credit card. There were also transactions made without using loyalty card. But business still seems abnormally bad. We would use gpstracking data to see if there were actually any visits to the supermarkets, just that transactions were paid in cash.

High Transaction Amounts

Here, we see that the transaction with the highest amount of $10,000 is made at Frydos Autosupply n More. It is also very different from its normal range, an extreme outlier. We shall note this point.

Extreme outliers are also detected at

We shall also note the highest value purchases which are greater than $4000, with the exception of the highest transaction at Stewart and Sons Fabrication, as that is near to its normal range.

Details of these transactions are tabulated as follows:

We note that owner of credit card 4530 (and loyalty card L8477) and 2276 (and loyalty card L3317) own two of these high transaction amounts each. It is also strange that the purchaser behind the huge transaction of $4918.39 at Abila Airport did not use his/her loyalty card to get further discounts on such a big ticket purchase.

Question 2

Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find?

Anomaly: Hippokampos nowhere to be found on Abila Map

# Add time column to gpstracking to enable filtering by time periods and date

gpstracking$time <- hms::as_hms(gpstracking$TimeStampFormatted)
gpstracking$hour <- hour(gpstracking$TimeStampFormatted)
gpstracking$date <- as_date(gpstracking$TimeStampFormatted)
# Filter gpstracking by time period

hippostops <- gpstracking %>% 
  filter(hour >=22, hour <= 23) %>%
  group_by(id) %>% 
  mutate(stop = TimeStampFormatted - lag(TimeStampFormatted)) %>%
  mutate(parked = ifelse(stop >60*3, TRUE,FALSE)) %>%
  ungroup() %>%
  filter(parked == TRUE) %>% mutate(count = n()) %>% 
  dplyr::select(c(id, lat, long, TimeStampFormatted, hour))

#Visualizing on map

  #Converting stop points to sf
  
  hippostops_sf <- st_as_sf(hippostops,
                         coords = c("long", "lat"),
                         crs = 4326) %>% 
    mutate(coordinates = geometry)
  
  #Viewing it on the map
  
  hippoviz <- tm_all_stops +
    tm_shape(hippostops_sf) +
    tm_dots(size = 0.1,
            alpha = 0.3,
            col = "green")
  
  hippoviz

Anomaly: Awfully Infrequent Visits at Refuel Stations

Anomaly: Unusually Infrequent Visits at Supermarkets

# A tibble: 442 x 3
# Groups:   LastName, date [442]
   LastName       date       namecountperday
   <chr>          <date>               <int>
 1 Vasco-Pais     2014-01-06            1171
 2 Barranco       2014-01-06            2110
 3 Frente         2014-01-06            3101
 4 Campo-Corrente 2014-01-06            1536
 5 Orilla         2014-01-06            3029
 6 Calzas         2014-01-06            2342
 7 Strum          2014-01-06            2175
 8 Vann           2014-01-06            3688
 9 Flecha         2014-01-06            1042
10 Cocinaro       2014-01-06            1675
# ... with 432 more rows
test <- gpsname %>% filter(id > 100) %>% 
  group_by(id) %>% 
  summarize(m = mean(TimeStampFormatted), do_union = FALSE)

test %>% dplyr::select(-m)
# A tibble: 5 x 2
     id do_union
  <dbl> <lgl>   
1   101 FALSE   
2   104 FALSE   
3   105 FALSE   
4   106 FALSE   
5   107 FALSE   
test
# A tibble: 5 x 3
     id m                   do_union
  <dbl> <dttm>              <lgl>   
1   101 2014-01-12 13:20:28 FALSE   
2   104 2014-01-14 04:16:25 FALSE   
3   105 2014-01-14 06:41:18 FALSE   
4   106 2014-01-14 06:09:40 FALSE   
5   107 2014-01-13 22:49:11 FALSE   

From the above plot, we see that most employees, except for Sanjorge Jr Sten, traveled daily. Sanjorge Jr Sten is the CEO of Gastech, and was likely chauffeured instead of driving on his own. We also notice that there is a high count of travels on weekdays with LastName = NA. Checking on the gpsname data, we find that there is gps tracking information for vehicle IDs 101, 104, 105, 106 and 107, which do not exist in the car_assigned dataset, this is why LastName was NA.

(This chunk onwards KIV)

Tracking of the route for vehicle ID 105 and 107 are as below.

Route of Vehicle ID 105
Route of Vehicle ID 107

The route shows that these vehicles comes out from GasTech, goes round nearby areas, and would then go back to GasTech. 105 would go to Carlyle Chemical Inc., Nationwide Refinery, and Guy’s Gyros. 107 would also travel to Maximum Iron and Steel often. Since both vehicles only move on weekdays, they are likely the company trucks, which are not assigned to a specific employee. The personnel who can use these vehicles are as tabulated below:

# A tibble: 9 x 5
  LastName  FirstName CarID CurrentEmploymentTy~ CurrentEmploymentTit~
  <chr>     <chr>     <dbl> <chr>                <chr>                
1 Hafon     Albina       NA Facilities           Truck Driver         
2 Hawelon   Benito       NA Facilities           Truck Driver         
3 Hawelon   Claudio      NA Facilities           Truck Driver         
4 Mies      Henk         NA Facilities           Truck Driver         
5 Morlun    Valeria      NA Facilities           Truck Driver         
6 Morlun    Adan         NA Facilities           Truck Driver         
7 Morluniau Cecilia      NA Facilities           Truck Driver         
8 Nant      Irene        NA Facilities           Truck Driver         
9 Scozzese  Dylan        NA Facilities           Truck Driver         

We also notice that there was unusually high movement for the trucks on 16 Jan - 26116 gps points/day, compared to the usual 3000 - 6000 points/day on the other days within these 2 weeks. Further investigation will be covered in Question 5.

Next, we will use the function below to do a plot of travel time periods on each day, for each personnel.

# A tibble: 6 x 28
  TimeStampFormatted     id   lat  long  hour name  LastName FirstName
  <dttm>              <dbl> <dbl> <dbl> <int> <chr> <chr>    <chr>    
1 2014-01-06 06:28:01    35  36.1  24.9     6 Will~ Vasco-P~ Willem   
2 2014-01-06 06:28:01    35  36.1  24.9     6 Will~ Vasco-P~ Willem   
3 2014-01-06 06:28:03    35  36.1  24.9     6 Will~ Vasco-P~ Willem   
4 2014-01-06 06:28:05    35  36.1  24.9     6 Will~ Vasco-P~ Willem   
5 2014-01-06 06:28:06    35  36.1  24.9     6 Will~ Vasco-P~ Willem   
6 2014-01-06 06:28:07    35  36.1  24.9     6 Will~ Vasco-P~ Willem   
# ... with 20 more variables: CurrentEmploymentType.x <chr>,
#   CurrentEmploymentTitle.x <chr>, date <date>, time <time>,
#   BirthDate <chr>, BirthCountry <chr>, Gender <chr>,
#   CitizenshipCountry <chr>, CitizenshipBasis <chr>,
#   CitizenshipStartDate <chr>, PassportCountry <chr>,
#   PassportIssueDate <chr>, PassportExpirationDate <chr>,
#   CurrentEmploymentType.y <chr>, CurrentEmploymentTitle.y <chr>,
#   CurrentEmploymentStartDate <chr>, EmailAddress <chr>,
#   MilitaryServiceBranch <chr>, MilitaryDischargeType <chr>,
#   MilitaryDischargeDate <chr>

*We see that there are frequent visits to the supermarkets and pump refuel stations. This means that there were transactions and business, but the transactions were not paid in credit card and loyalty cards were not used either.

Question 3

Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data?

#Prep Data first by joining loyalty card number into credit card dataset

credit_loyalty <- full_join(creditcard, loyaltycard, by = c("price" = "price", "date" = "TimeStampFormatted", "location" = "location"))

#To see if credit card records and loyalty card pairs are unique

creditloyaltyuniq <- 
  credit_loyalty %>% 
  group_by(last4ccnum, loyaltynum) %>% 
  summarize(Count = n())

creditloyaltyuniq
# A tibble: 169 x 3
# Groups:   last4ccnum [56]
   last4ccnum loyaltynum Count
        <dbl> <chr>      <int>
 1       1286 L3288         15
 2       1286 L3572         13
 3       1286 <NA>           4
 4       1310 L8012         21
 5       1310 <NA>          12
 6       1321 L4149         22
 7       1321 <NA>          10
 8       1415 L7783         24
 9       1415 <NA>           7
10       1874 L4424         25
# ... with 159 more rows
howmany <- loyaltycard %>% distinct(loyaltynum)
howmany
# A tibble: 54 x 1
   loyaltynum
   <chr>     
 1 L2247     
 2 L9406     
 3 L8328     
 4 L6417     
 5 L1107     
 6 L4034     
 7 L6110     
 8 L2343     
 9 L9018     
10 L6267     
# ... with 44 more rows
figD <- creditloyaltyuniq %>% 
  ggplot(aes(x = last4ccnum, y = loyaltynum)) +
  geom_tile(aes(fill = Count)) +
  scale_fill_distiller(palette = "YlGnBu", direction = -1) +
  labs(title = "No. of Purchases on Each Credit Card and Loyalty Card Pair", x = "Credit Card No.", y = "Loyalty Card No.")

ggplotly(figD)
#figE <- sankeyNetwork(Links = 
  
#)

This is the anomaly:

One loyalty card associated with multiple credit cards:

L9406 associated with 4948 (22) and 5921 (1). L8566 associated with 4795 (25) and 8332 (1). L6119 associated with 7889 (20) and 5368 (1). L3295 associated with 4948 (1) and 5921 (12). L3288 associated with 1286 (15) and 9241 (13). L2247 associated with 5368 (24) and 7889 (1). L2070 associated with 4795 (1) and 8332 (27).

One credit card associated with multiple loyalty cards:

1286 associated with L3288 (15) and L3572 (13) 4795 associated with L2070 (1) and L8566 (25) 4948 associated with L3295 (1) and L9406 (22) 5368 associated with L2247 (24) and L6119 (1) 5921 associated with L3295 (12) and L9406 (1) [Same pattern as 4948! Opposite!] 7889 associated with L2247 (1) and L6119 (20) [Same pattern with 5368! Opposite!] 8332 associated with L2070 (27) and L8566 (1) [Same pattern as 4795!]

fig <- plot_ly(
    type = "sankey",
    orientation = "h",

    node = list(
      label = c("A1", "A2", "B1", "B2", "C1", "C2"),
      color = c("blue", "blue", "blue", "blue", "blue", "blue"),
      pad = 15,
      thickness = 20,
      line = list(
        color = "black",
        width = 0.5
      )
    ),

    link = list(
      source = c(0,1,0,2,3,3),
      target = c(2,3,3,4,4,5),
      value =  c(8,4,2,8,4,2)
    )
  )
fig <- fig %>% layout(
    title = "Basic Sankey Diagram",
    font = list(
      size = 10
    )
)

Question 4

Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships.

  1. Chostus Hotel transaction at 12noon, then 1-2pm in the afternoon. These are the usual hotel checkout timings. We can track who made these transactions.

  2. Find out who went to whose home at night

Question 5

Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why

  1. Unusually high movement activity for truck on 16 Jan 2014.

  2. 1 visit to Daily Dealz on 13 Jan 2014, at 6am in the morning.

  3. Sanjorge Jr. only had 3 day of GPS records from 17th to 19th Jan, while everyone else travelled almost daily.

  4. Stolen credit card?

  5. L3288 associated with 1286 (15) and 9241 (13).1286 associated with L3288 (15) and L3572 (13)

  6. 5921 associated with L3295 (12) and L9406 (1) [Same pattern as 4948! Opposite!]. 4948 associated with L3295 (1) and L9406 (22)

  7. 7889 associated with L2247 (1) and L6119 (20) [Same pattern with 5368! Opposite!] 5368 associated with L2247 (24) and L6119 (1)

  8. 8332 associated with L2070 (27) and L8566 (1) [Same pattern as 4795!] 4795 associated with L2070 (1) and L8566 (25)

  9. Guy’s Gyros is unusually crowded on 19 Jan (Sun) compared to the previous Sunday and weekend crowd.

  10. Supermarket transactions in the wee hours, with high transaction amounts

The above plot also shows that there were 5 transactions made at Kronos Mart in the wee hours between 0300 - 0400 hrs. The 5 transactions are as shown in the table below. We also observe that the transaction amounts are unusually high, for a midnight purchase.

kronos <- creditcard %>% filter(location == "Kronos Mart") %>% 
  arrange(time)  %>%  slice_min(time,n = 6)

kronos
# A tibble: 6 x 6
  TimeStampFormatted  location    price last4ccnum date       time  
  <dttm>              <chr>       <dbl>      <dbl> <date>     <time>
1 2014-01-13 03:00:00 Kronos Mart 147.        5407 2014-01-13 03:00 
2 2014-01-19 03:13:00 Kronos Mart  87.7       3484 2014-01-19 03:13 
3 2014-01-12 03:39:00 Kronos Mart 277.        8156 2014-01-12 03:39 
4 2014-01-19 03:45:00 Kronos Mart 195.        9551 2014-01-19 03:45 
5 2014-01-19 03:48:00 Kronos Mart 150.        8332 2014-01-19 03:48 
6 2014-01-16 07:30:00 Kronos Mart 299.        7108 2014-01-16 07:30 
  1. High transaction amounts

Plan

Question 1

Question 2

Question 3

Question 4

Packages to review

Question 5